96 research outputs found

    Redes neuronales auto-organizativas basadas en optimización funcional. Aplicación en bioinformática y biología computacional

    Full text link
    Tesis doctoral inédita de la Universidad Autónoma de Madrid. Escuela Politécnica Superior, Departamento de Ingeniería informática. Fecha de lectura: 25-11-200

    Assessment of protein set coherence using functional annotations

    Get PDF
    12 pages, 5 figures. -- PMID: 18937846 [PubMed].-- PMCID: PMC2588600.-- Additional information available: File 1: Coherence score and significance measures of random sets.- File 2: Functional analysis of 'Module 39' obtained by Pu et al. [37] using various approaches.[Background] Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set.[Results] In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation.[Conclusions] We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described for Saccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating ‘functional modules’ obtained from computational analysis of protein-protein interaction networks.Matlab code and supplementary data are available at: http://www.cnb.csic.es/~monica/coherence/This work has been partially funded by the Spanish grants BIO2007-67150-C03-02, S-Gen- 0166/2006, CYTED-505PI0058, TIN2005-5619, PR27/05-13964-BSCH. APM acknowledges the support of the Spanish Ramón y Cajal program.Peer reviewe

    Moara: a Java library for extracting and normalizing gene and protein mentions

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene/protein recognition and normalization are important preliminary steps for many biological text mining tasks, such as information retrieval, protein-protein interactions, and extraction of semantic information, among others. Despite dedication to these problems and effective solutions being reported, easily integrated tools to perform these tasks are not readily available.</p> <p>Results</p> <p>This study proposes a versatile and trainable Java library that implements gene/protein tagger and normalization steps based on machine learning approaches. The system has been trained for several model organisms and corpora but can be expanded to support new organisms and documents.</p> <p>Conclusions</p> <p>Moara is a flexible, trainable and open-source system that is not specifically orientated to any organism and therefore does not requires specific tuning in the algorithms or dictionaries utilized. Moara can be used as a stand-alone application or can be incorporated in the workflow of a more general text mining system.</p

    Biclustering of gene expression data by non-smooth non-negative matrix factorization

    Get PDF
    BACKGROUND: The extended use of microarray technologies has enabled the generation and accumulation of gene expression datasets that contain expression levels of thousands of genes across tens or hundreds of different experimental conditions. One of the major challenges in the analysis of such datasets is to discover local structures composed by sets of genes that show coherent expression patterns across subsets of experimental conditions. These patterns may provide clues about the main biological processes associated to different physiological states. RESULTS: In this work we present a methodology able to cluster genes and conditions highly related in sub-portions of the data. Our approach is based on a new data mining technique, Non-smooth Non-Negative Matrix Factorization (nsNMF), able to identify localized patterns in large datasets. We assessed the potential of this methodology analyzing several synthetic datasets as well as two large and heterogeneous sets of gene expression profiles. In all cases the method was able to identify localized features related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The uncovered structures showed a clear biological meaning in terms of relationships among functional annotations of genes and the phenotypes or physiological states of the associated conditions. CONCLUSION: The proposed approach can be a useful tool to analyze large and heterogeneous gene expression datasets. The method is able to identify complex relationships among genes and conditions that are difficult to identify by standard clustering algorithms

    Functional Analysis beyond Enrichment: Non-Redundant Reciprocal Linkage of Genes and Biological Terms

    Get PDF
    Functional analysis of large sets of genes and proteins is becoming more and more necessary with the increase of experimental biomolecular data at omic-scale. Enrichment analysis is by far the most popular available methodology to derive functional implications of sets of cooperating genes. The problem with these techniques relies in the redundancy of resulting information, that in most cases generate lots of trivial results with high risk to mask the reality of key biological events. We present and describe a computational method, called GeneTerm Linker, that filters and links enriched output data identifying sets of associated genes and terms, producing metagroups of coherent biological significance. The method uses fuzzy reciprocal linkage between genes and terms to unravel their functional convergence and associations. The algorithm is tested with a small set of well known interacting proteins from yeast and with a large collection of reference sets from three heterogeneous resources: multiprotein complexes (CORUM), cellular pathways (SGD) and human diseases (OMIM). Statistical Precision, Recall and balanced F-score are calculated showing robust results, even when different levels of random noise are included in the test sets. Although we could not find an equivalent method, we present a comparative analysis with a widely used method that combines enrichment and functional annotation clustering. A web application to use the method here proposed is provided at http://gtlinker.cnb.csic.es

    Discovering semantic features in the literature: a foundation for building functional associations

    Get PDF
    BACKGROUND: Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. RESULTS: We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. CONCLUSION: The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data

    GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists

    Get PDF
    We present GENECODIS, a web-based tool that integrates different sources of information to search for annotations that frequently co-occur in a set of genes and rank them by statistical significance. The analysis of concurrent annotations provides significant information for the biologic interpretation of high-throughput experiments and may outperform the results of standard methods for the functional analysis of gene lists. GENECODIS is publicly available at

    Integrated analysis of gene expression by association rules discovery

    Get PDF
    BACKGROUND: Microarray technology is generating huge amounts of data about the expression level of thousands of genes, or even whole genomes, across different experimental conditions. To extract biological knowledge, and to fully understand such datasets, it is essential to include external biological information about genes and gene products to the analysis of expression data. However, most of the current approaches to analyze microarray datasets are mainly focused on the analysis of experimental data, and external biological information is incorporated as a posterior process. RESULTS: In this study we present a method for the integrative analysis of microarray data based on the Association Rules Discovery data mining technique. The approach integrates gene annotations and expression data to discover intrinsic associations among both data sources based on co-occurrence patterns. We applied the proposed methodology to the analysis of gene expression datasets in which genes were annotated with metabolic pathways, transcriptional regulators and Gene Ontology categories. Automatically extracted associations revealed significant relationships among these gene attributes and expression patterns, where many of them are clearly supported by recently reported work. CONCLUSION: The integration of external biological information and gene expression data can provide insights about the biological processes associated to gene expression programs. In this paper we show that the proposed methodology is able to integrate multiple gene annotations and expression data in the same analytic framework and extract meaningful associations among heterogeneous sources of data. An implementation of the method is included in the Engene software package

    Comparison of molecular dynamics and superfamily spaces of protein domain deformation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It is well known the strong relationship between protein structure and flexibility, on one hand, and biological protein function, on the other hand. Technically, protein flexibility exploration is an essential task in many applications, such as protein structure prediction and modeling. In this contribution we have compared two different approaches to explore the flexibility space of protein domains: i) molecular dynamics (MD-space), and ii) the study of the structural changes within superfamily (SF-space).</p> <p>Results</p> <p>Our analysis indicates that the MD-space and the SF-space display a significant overlap, but are still different enough to be considered as complementary. The SF-space space is wider but less complex than the MD-space, irrespective of the number of members in the superfamily. Also, the SF-space does not sample all possibilities offered by the MD-space, but often introduces very large changes along just a few deformation modes, whose number tend to a plateau as the number of related folds in the superfamily increases.</p> <p>Conclusion</p> <p>Theoretically, we obtained two conclusions. First, that function restricts the access to some flexibility patterns to evolution, as we observe that when a superfamily member changes to become another, the path does not completely overlap with the physical deformability. Second, that conformational changes from variation in a superfamily are larger and much simpler than those allowed by physical deformability. Methodologically, the conclusion is that both spaces studied are complementary, and have different size and complexity. We expect this fact to have application in fields as 3D-EM/X-ray hybrid models or <it>ab initio </it>protein folding.</p

    Optimization problems in electron microscopy of single particles

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10479-006-0078-8Electron Microscopy is a valuable tool for the elucidation of the three-dimensional structure of macromolecular complexes. Knowledge about the macromolecular structure provides important information about its function and how it is carried out. This work addresses the issue of three-dimensional reconstruction of biological macromolecules from electron microscopy images. In particular, it focuses on a methodology known as “single-particles” and makes a thorough review of all those steps that can be expressed as an optimization problem. In spite of important advances in recent years, there are still unresolved challenges in the field that offer an excellent testbed for new and more powerful optimization techniques.We acknowledge partial support from the “Comunidad Autónoma de Madrid” through grants CAM-07B-0032-2002, GR/SAL/0653/2004 and GR/SAL/0342/2004, the “Comisión Interministerial de Ciencia yTecnologia” of Spain through grants BIO2001-1237, BIO2001-4253-E, BIO2001-4339-E, BIO2002- 10855-E, BFU2004-00217/BMC, the Spanish FIS grant (G03/185), the European Union through grants QLK2- 2000-00634, QLRI-2000-31237, QLRT-2000-0136, QLRI-2001-00015, FP6-502828 and the NIH through grant HL70472. Alberto Pascual and Roberto Marabini acknowledge support by the Spanish Ramon y Cajal Program
    corecore